Expected Time to Completion: 45 minutes
The ape package provides lots of useful, efficient tools for dealing with trees in R. We load it right away.
library("phyloseq"); packageVersion("phyloseq")
## [1] '1.9.11'
library("ape"); packageVersion("ape")
## [1] '3.1.3'
library("ggplot2"); packageVersion("ggplot2")
## [1] '1.0.0'
load("example-data.RData")
ls()
## [1] "closedps" "metadata" "ps0" "qiimedata"
Parts of “phylo” object. It is actually a list
tree = phy_tree(closedps)
class(tree)
## [1] "phylo"
names(tree)
## [1] "edge" "Nnode" "tip.label" "edge.length" "node.label"
The edge matrix
head(tree$edge)
## [,1] [,2]
## [1,] 74 1
## [2,] 74 75
## [3,] 75 2
## [4,] 75 76
## [5,] 76 77
## [6,] 77 78
The edge matrix
tree$Nnode
## [1] 72
tree$tip.label[1:5]
## [1] "169901" "673925" "4470518" "1107945" "4346374"
taxa_names(tree)[1:5]
## [1] "169901" "673925" "4470518" "1107945" "4346374"
tree$edge.length[1:5]
## [1] 0.39755 0.08322 0.38007 0.02085 0.39352
tree$node.label[1:5]
## [1] "" "" "" "" ""
The node labels were empty. Let’s assign simulated bootstrap values to each node label using standard R list assignment semantics.
set.seed(711)
tree$node.label <- round(runif(tree$Nnode, 0 , 1), digits = 3)
All other aspects of a “phylo” object are just as easy to modify. This is both convenient and dangerous. Why?
What about assigning to a tree in a phyloseq object?
Easy! Just use the same list assignment semantics.
phy_tree(closedps)$node.label[1:5]
## [1] "" "" "" "" ""
phy_tree(closedps)$node.label <- tree$node.label
phy_tree(closedps)$node.label[1:5]
## [1] 0.290 0.426 0.325 0.630 0.445
You can even replace the entire tree.
Caution: this may side-step the automatic index checks, like OTU names, that are built-in to the phyloseq function.
phy_tree(closedps) <- tree
phy_tree(closedps)
##
## Phylogenetic tree with 73 tips and 72 internal nodes.
##
## Tip labels:
## 169901, 673925, 4470518, 1107945, 4346374, 4414420, ...
## Node labels:
## 0.29, 0.426, 0.325, 0.63, 0.445, 0.123, ...
##
## Rooted; includes branch lengths.
Let’s check which OTU is considered root in this tree.
is.rooted(tree)
## [1] TRUE
ROOT = tree$edge[(ntaxa(tree) + 1), 2]
# check
sum(tree$edge[, 2] == ROOT)
## [1] 1
sum(tree$edge[, 2] == ROOT+2L)
## [1] 1
rootAnc = tree$edge[(ntaxa(tree) + 1), 1]
tree$edge[(tree$edge[, 2] == rootAnc), ]
## [1] 110 111
Add the node IDs on a “native R” ape tree plot
plot(tree); nodelabels()
Plot the pretend bootstrap values instead of the node ID.
plot(tree); nodelabels(tree$node.label)
plot_treephyloseq defaults are a bit more legible.
plot_tree(tree, ladderize = "left", label.tips = "OTU")
plot_treeCan add symbols to simplify bootstrap values
plot_tree(tree, nodelabf = nodeplotboot(80), ladderize = "left", label.tips = "OTU")
plot_treeCan add symbols representing samples
plot_tree(closedps, nodelabf = nodeplotboot(80), ladderize = "left", label.tips = "OTU",
color = "Treatment", justify = "left")
Convert to relative abundance, then calculate distance.
cpsp = transform_sample_counts(closedps, function(x){x/sum(x)})
distance(cpsp, "unifrac", weighted=TRUE)
## PC.354 PC.355 PC.356 PC.481 PC.593 PC.607 PC.634 PC.635
## PC.355 0.3611
## PC.356 0.2574 0.2234
## PC.481 0.2305 0.2669 0.2411
## PC.593 0.4754 0.3084 0.2929 0.3678
## PC.607 0.4955 0.3206 0.3527 0.3671 0.2079
## PC.634 0.6759 0.4067 0.5343 0.5692 0.4416 0.4429
## PC.635 0.5547 0.3276 0.3980 0.4507 0.3519 0.2942 0.2615
## PC.636 0.6344 0.3621 0.4803 0.5336 0.3839 0.3907 0.1602 0.2147
duf = distance(cpsp, "unifrac", weighted=TRUE)
dufPCoA = ordinate(closedps, "PCoA", duf)
plot_scree(dufPCoA)
Make the w-UniFrac PCoA graphic
plot_ordination(cpsp, dufPCoA,
color = "Treatment", title="w-UniFrac PCoA") + geom_point(size=7)
plot_ordination(cpsp, dufPCoA, type = "split",
shape="Treatment", color="Phylum", title="wUF-MDS Split Plot")
What about alternative ordination method, like NMDS?
How would you encode that?
dpcoa = ordinate(cpsp, "DPCoA")
plot_scree(dpcoa)
plot_ordination(cpsp, dpcoa,
color = "Treatment", title="DPCoA") + geom_point(size=7)
plot_ordination(cpsp, dpcoa, type = "split",
shape="Treatment", color="Phylum", title="DPCoA Split Plot")
There are lots of types of networks.
Here we are discussing a very simple network that represents (dis)similarity values. Its main purpose in our case is for data exploration.
Lots on this subject. See CRAN task view(s)
phyloseq uses the igraph package for its internal network methods.
There are two network plot functions in phyloseq:
plot_netplot_network (legacy, igraph manipulation)plot_net
plot_networkThere is a random aspect to some of the network layout methods. For complete reproducibility of the images produced later in this tutorial, it is possible to set the random number generator seed explicitly:
set.seed(711L)
Because we want to use the enterotype designations as a plot feature in these plots, we need to remove the 9 samples for which no enterotype designation was assigned (this will save us the hassle of some pesky warning messages, but everything still works; the offending samples are anyway omitted).
data("enterotype")
enterotype = subset_samples(enterotype, !is.na(Enterotype))
The newer plot_net function does not require a separate make_network function call, or a separate igraph object. For examples running the older plot_network function, which may provide some added flexibility with igraph objects, see the plot_network section later.
Try plot_net with the default settings.
plot_net(enterotype, maxdist = 0.4, point_label = "Sample_ID")
The previous graphic displayed some interesting structure, with one or two major subgraphs comprising a majority of samples. Furthermore, there seemed to be a correlation in the sample naming scheme and position within the network.
Instead of trying to read all of the sample names to understand the pattern, let’s map some of the sample variables onto this graphic as color and shape:
plot_net(enterotype, maxdist = 0.3, color = "SeqTech", shape="Enterotype")
In the previous examples, the choice of maximum-distance and distance method were informed, but arbitrary.
maxdist value is decreased?? (hint: this will usually decrease the number of edges in the network).